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Abstract. Two words wi and W 2 are said to be fe-binomial equivalent 
if every non-empty word x of length at most k over the alphabet of wi 
and W 2 appears as a scattered factor of wi exactly as many times as it 
appears as a scattered factor of W 2 - We give two different polynomial-time 
algorithms testing the fe-binomial equivalence of two words. The first one 
is deterministic (but the degree of the corresponding polynomial is too 
high) while the second one is randomised (but more direct and efficient). 


1 Introduction 

An alphabet is a finite and nonempty set of symbols (also called letters). Any 
finite sequence of symbols from an alphabet S is called a word over S. The set of 
all words over S is denoted by E* and the empty word is denoted by e; also E^ 
is the set of non-empty words over E, E^ is the set of all words over E of length 
exactly k, while E-^ is the set of all words over E of length at most k. Given a 
word w over an alphabet A, we denote by |r(;| its length; for some 1 < f < Iwl 
we denote the f-th letter of w by We also denote the factor that starts with 
the i-th letter and ends with the j-th letter in w by For w,x € E'^ we 

denote by |ui|a; the number of distinct occurrences of cc as a factor of w. 

A scattered factor of w C A* is a word w[ii\ ■ ■ • for some fc > 1 such 
that ij < ij+i for all 1 < j < fc — 1. The binomial coefficient of u and v, denoted 
(“), equals the number of occurrences of u as a scattered factor of u. Clearly, 

* The results presented in this paper were partly obtained during the Dagstuhl sem¬ 
inar 14111, in March 2014. Dominik D. Freydenberger was supported by the DFG 
grant FR 3551/1-1. Pawel Gawrychowski is currently holding a post-doctoral posi¬ 
tion at Warsaw Center of Mathematics and Computer Science. Juhani Karhumaki 
was supported by Academy of Finland under the grant 257857. Florin Manea was 
supported by the DFG grant 596676. Wojciech Rytter was supported by the grant 
NCN2014/13/B/ST6/00770 of the Polish Science Center. 
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for a £ X' we have (“) = |M|a, while for x £ 27+ with |a;| > 2 it is not necessary 
that \u\x = (“)■ For example, if u = bbaa and v = ba we have (“) = = 4, 

as u[l]u[3] = m[2]m[ 3] = u[l]u[4] = m[2]m[4] = ba; clearly, \u\ba = 1. 

For more details regarding these binomial coefficients see Chapter 6, by 
Sakarovitch and Simon, from the handbook [8]. 

A well known equivalence relation between words is that of abelian equiva¬ 
lence. Two words wi,W 2 £ 27* are said to be abelian equivalent if for all a £ 27 
we have |uii|a = |iC 2 |a; equivalently, wi and W 2 are abelian equivalent if they 
have the same Parikh vector, thus being permutations of each other. This rela¬ 
tion was extended in [7] (see also 0), where the k-abelian equivalence relation 
was defined. Two words wi,W 2 £ 27* are said to be k-abelian equivalent if for all 
X £ 27-^ we have |wi|x = |w 2 U- Obviously, the 1-abelian equivalence relation is 
the same as the abelian equivalence. 

As |rci |a = , another way to generalise the abelian equivalence relation is 

to define the k-binomial equivalence (see the conference paper as well as its 
journal version [12)1. Two words wi,W 2 £ 27* are said to be k-binomial equivalent 
if for all X £ 27-^ we have = (“^); if wi and W 2 are fc-binomial equivalent, 
we write wi =k W 2 - Again, it is easy to see that the 1-binomial equivalence is 
the same as the abelian equivalence. Combinatorial properties of the fc-binomial 
equivalence relation are studied in [11 | 12 |10]. 

Recently, in [314] a series of algorithmic results regarding the fc-abelian equiv¬ 
alence were shown. As a basic result, it was shown that one can test whether 
two words are /c-abelian equivalent in linear time. Therefore, it seems natural to 
us to study a similar problem in the context of /c-binomial equivalence. That is, 
we are interested in the following problem. 

Problem 1. Given wi,W 2 £ 27*, with Iwil = |■u; 2 | = n, and k < n, decide whether 

Wi =k W2- 


Our main result shows that Problem [T] can be solved in polynomial time. 
The proof of this result uses a series of known results from the theory of finite 
automata, which does not exploit in any way the properties of fc-binomial equiva¬ 
lence. Moreover, the degree of the polynomial characterising the time complexity 
of this algorithm is rather high, so we do not give it explicitly. Instead, we also 
show a simpler and much more direct Monte-Carlo algorithm solving the same 
problem. Our solutions assume a basic understanding of formal languages and 
automata theory; for more details, see [13] and [14]. 

The main motivation of studying the algorithmic properties of the fc-binomial 
equivalence relation is of fundamental nature: we have a new relation on words 
and we are, naturally, interested in how we can effectively test whether two words 
are equivalent with respect to this relation. Our results are also motivated by 
the work done in avoidability of fc-binomial repetitions (e.g., squares and cubes 
in m)- Constructing infinite words avoiding consecutive occurrences of factors 
from the same equivalence class with respect to the fc-binomial equivalence of¬ 
ten requires extensive computer simulations, whose basic operation is testing 
whether two consecutive factors are equivalent. As the words one constructs 
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in such simulations are getting longer and longer, so do their factors whose 
equivalence one needs to test; consequently, efficient algorithms for testing the 
equivalence of words are required. 

Before moving to the main sections of this paper, we just point out that the 
complexity results we show here hold in the unit-cost RAM with logarithmic 
size memory word. In this model (which is generally used in the analysis of algo¬ 
rithms) we assume that, if the size of the input is n (e.g., we are given a word of 
length n), each memory cell can store f2(logn) bits, or, in other words, that the 
machine word size is l7(logn). The instructions are executed one after another, 
with no concurrent operations. The model contains common instructions: arith¬ 
metic (add, subtract, multiply, divide, remainder, shifts and bitwise operations, 
equality testing, etc.), data movement (indirect addressing, load the content of 
a memory cell, store a number in a memory cell, copy the content of a memory 
cell to another), and control (conditional and unconditional branch, subroutine 
call and return). Each such instruction takes a constant amount of time. This 
model allows measuring the number of instructions executed in an algorithm, 
making abstraction of the time spent to execute each of the basic instructions. 


2 A polynomial deterministic algorithm 


The first step we take towards solving Problem [T] is to construct, for a word 
w, a non-deterministic finite automaton that accepts exactly the scattered 
factors of length at most k oi w and, moreover, has exactly (“) paths labelled 
with the scattered factor x oiw. 

Let us assume that |ui| = n; then has nk + 2 states; these states are 

Qw = {(0,0)} U {{i,j) \ l<i<n,l<j<k}U{{n + l,k + 1)}. 

The initial state of the automaton is (0,0), while every state {i,j) with 0 < j < k 
and i > j is final. The state (n -|- 1, fc -I- 1) is an error state; this state and the 
initial state are the only states that are not final. 

We define the transition function 6^ for all (i,j) S Qw and all a € A by 




{(£, j -b 1) € Qw \ ( > i, w[l] = a} if this set is non-empty, 
{(n-b 1,/c-b 1)1 otherwise. 


See Figure [T] for an illustration. An immediate consequence of this definition is 
that 5w{{n -b 1, fc -b 1), a) = {(n -b 1, fc -b 1)} holds for all a € S. 

It is not hard to see that Aw accepts exactly the words • • • w[ik'] with 
k' < k and ii < ... < ik'- Indeed, to accept such a word the automaton starts 
in the state (0, 0), and then goes through the states 


(*i, 1), (*2,2),..., (ijJ ),..., iik',k'); 

as 1 < ii < ... < ife' it is clear that if > k', so the state reached by the 
automaton is an accepting one. For the reverse implication, assume that the 
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Fig. 1. The definition of the transition function: all the transitions leaving {i,j) and 
leading to a non-error state, as well as all transitions going to We have + € 

^■w{{i,j),w[£]), with i < £ <n and j < k. 


word X is accepted by on the path formed by the states 
(0, 0), (ii, 1), ( 12 , 2),..., {ijj),iik>,k'). 

By the definition of we immediately get that ij < ij+i for all 1 < j < 
k' — 1; also, ii > 0. Thus, ij > j 1 < j < k'. Moreover, each transition ending in 
(ij,j) is labelled with w[ij\, so x = w[ii\ ■ ■ is a scattered factor of w. 

Finally, the argument above shows that there is a bijective correspondence 
between the sequences of indices defining the scattered factors of length at most 
k of w and the paths of A^. In conclusion, A^^ accepts the set of scattered factors 
of length at most k of w and, moreover, has exactly as many paths labelled with 
the scattered factor a; of w as the total number of occurrences of a; as a scattered 
factor of w (i.e., (“))■ 

Before coming back to the solution Problem [U we recall that two non- 
deterministic finite automata are said to be path-equivalent if for each word x 
the number of distinct accepting paths labelled with x of Ai equals the number 
of distinct accepting paths labelled with x of A 2 , or both are infinite. 

In our problem, we were given wi and W 2 and wanted to test whether wi =k 
W 2 ■ By the above, it is enough to construct A^i and Ajjj^ and test whether Aw^^ 
and Ayj^ are path-equivalent. The latter property is decidable (see [15114) and 
the references within for a discussion on this problem and its complexity). 

In the following, we show that this algorithm runs in polynomial time in our 
model of computation. The construction of the two automata A^j^ and Aw 2 takes 
0{nk) time. Moreover, as none of A^^ and A^j^ has transitions labelled with e, 
it follows from m that there is an algorithm deciding the path-equivalence 
of A^^ and ^^,3 that runs in polynomial time with respect to the size of these 
automata (so, essentially, with respect to nk). Note that the algorithm presented 
in m is only shown to run in polynomial time in a computational model where 
it is assumed that the arithmetic operations between any (no matter how big) 
rational numbers can be done in constant time. To show that the algorithm still 
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runs in polynomial time in our model of computation, we need to go further into 
details. 

Basically, the algorithm of [15] applied to the two automata we constructed 
either decides that and identifies the lexicographically first word x such 
that has a different number of accepting paths labelled with x than Aw 2 ■ To 
do this, the algorithm explores the set of words from U* in lexicographical order; 
it maintains a list of words V and for each v G V the array P{v) storing the 
number of accepting paths in Ay^^ and A^^ (that is, an array storing for each final 
state of the two automata, how many paths labelled with v connect the initial 
state of the respective automata to that final state). If the list V contains at some 
moment the words Xi, ... ,xe and the new considered word is x, the algorithm 
checks if the array P(x) is linearly independent from P(xi ),..., P{xi). If yes, x 
is added to V and the algorithm further tries all words xa with a G S. If no, the 
algor ithm stops trying any other word that has x as a prefix. In m it is shown 
that only a polynomial number of words should be tried in this process, since 
V may contain up to 2nk words (as many words as the number of final states 
of the two automata). In our particular case, it is clear that all words that are 
longer than k are not accepted by any of our automata (i.e., the array P{x) of 
some X longer than k contains only Os); so, essentially, our algorithm will only 
try words of length at most fc + 1. Each such word x that is accepted by one of 
our automata is accepted on at most r/ paths, where ^ < k is the length of cc, in 
total. So, its array P{x) can be stored in at most 0(nk^) memory words (that 
is, k memory words for each final state, or, in other words, k memory words for 
each component of the array). At each step of the algorithm, we test whether 
the newly considered x produces an array P{x) linearly independent from the 
arrays P{y) with y G V; since all these arrays contain only words that can be 
stored on k memory words, this test can be done in polynomial time. Indeed, if 
we use either a Gaussian elimination method or a modular method, such a test 
can be implemented in polynomial time (see, e.g., [1] and the references within, 
as well as E)- Finally, the algorithm just checks whether there exists a word x 
in V which is accepted on a different number of paths in wi than in wi . Again, 
this clearly takes polynomial time. 

This concludes our analysis. We do not go into details and compute the 
exact complexity of the algorithm described above: we just state that it runs 
in polynomial time. While the preprocessing phase in which Ay,.^ and Ay,^ are 
constructed is rather simple, computing the complexity of the algorithm from m 
requires really going into the implementation details of each step (for instance, 
testing the linear independency of the arrays), and this is not our purpose. We 
just note that the exponent of n in the complexity of this algorithm is at least 3 
(in other words, the algorithm is at least cubic in n). The main result of our 
paper is, thus, the following theorem. 

Theorem 1. Problem{li can be solved in polynomial time. 

Although based on a rather simple idea (the construction of the two au¬ 
tomata), the algorithm presented in this section has a drawback: the main part 
of the computation is hidden in the algorithm checking the path equivalence of 
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these two automata. Accordingly, in the following section we present a direct 
and more efficient randomised algorithm testing the /c-binomial equivalence of 
two words. 

3 A Monte-Carlo algorithm 

We begin with a series of prerequisites. The first one is a folklore result; although 
it is really well known, we give a short sketch of the proof for completeness. 

Lemma 1. We can generate a number p using 0(t^) operations on t-bit num¬ 
bers, so that p is a random t-bit prime with probability at least 1 — . 

Proof. We recall that, given a t-bit number p, one iteration of the Rabin-Miller 
primality test [5] performs 0{t) operations on t-bit numbers, always returns yes 
if p is prime, and otherwise returns no with probability at least |. We choose a 
random odd t-bit number p and execute one iteration of the Rabin-Miller test. 
If the test succeeds, we return p, and otherwise repeat. By Theorem 2 of [2], the 
procedure returns a composite p with probability less than However, 

we need to modify it so that the total number of operations is always 0{t^). To 
this end, we simply terminate after having tried 0{t^) random t-bit numbers. By 
the prime number theorem, the probability of a random odd t-bit number being 
prime is 0{Y). Hence if we generate 0{t^) such random numbers, the probability 
of all of them being composite is at most for t large enough. Therefore, the 
total error probability is less than ^ for t large enough. (For smaller t, we can 
use a naive method.) The total number of operations is now always 0{t^). □ 

The second auxiliary result is a particular case of the Schwartz-Zippel lemma. 
For a prime number p, let Fp denote the finite field on p elements consisting of 
the integers modulo p. It is well known that a non-zero polynomial Q G Fp [x] of 
degree d has at most d distinct roots in Fp. Thus, the following trivially holds: 

Lemma 2. Let Q he a non-zero polynomial of degree d in¥p. Then, the proba¬ 
bility that a randomly chosen x &¥p is a root of Q is at most ^.' 



We now continue with the main part of this section. 

For V G {0,1}^ let bin(z;) be the number which binary representation is v. 
We define the crucial polynomial: 



\v\<k 
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Example 1. 

00 ^ 10 ^ ^bin(io) 1 ^ 0010 ^ ^bi„(ii) ^ 1^001 

2;'^in(101) _|_ ^0010^ ^bin(lio) _|_ ^0010^ ^bin(lll) 

= Zx^ + + 3x^ + x^ + X® 

Clearly the powers of the variable x encode uniquely the scattered factors of the 
word w, consequently: 

Observation 1. wi =k W 2 if and only if Qk,wi = Qk,w 2 in Z. 

By definition, for any word w with |r(;| > k, the degree of Qk,w{x) is 2^+^ — 1, 
so we cannot afford (time-wise) to construct it explicitly for any of the words 
wi and W 2 , as enumerating the coefficients of such a polynomial would take 
exponential time. So, what we should see now is how to compute efficiently 
Qwi,w 2 (pf) '■= Qk,wi (x) — Qk,w 2 (x) in Fp; this is solved in the next lemma, where 
we show how Qk,w(x) is computed in time 0(nk^) for a word w of length n. 
In the end we will choose p such that logp = 0{k + logn). Consequently, we 
assume that operations on numbers in Fp take 0{k) time, because in our model 
two numbers consisting of 6 < n bits can be added, subtracted, multiplied, and 
divided in 0(1 The bottleneck is that we cannot construct 
Qwi,w 2 explicitly, so we need to go around this in order to compute Qwi,w 2 {x)- 

Lemma 3. For a word w of length n, the value Qk,w(x) in Fp can be computed 
in 0{k^n) time. 

Proof. We use an auxiliary polynomial. Let 




QU(^)=E (r) 

|v|=t ^ ^ 

In other words Q^^^{x) = C(e,wX^, where ae^w is the number of scattered 

occurrences of the word of length t corresponding to the binary expansion of 
£ in the whole word w. For example if t = 3 and w = 11010, then = 4, 
because 6 = IIO 2 and there are four scattered occurrences of 110 in 11010, i.e., 
(^110°) “ enough to compute polynomials Q' since 



Qk,w {x'j. 


The additional factor x^ is needed since two different words v, v' can start with 
different number of zeros, so it can be the case that bin(u) = bin(u') despite the 
fact that, actually, v ^ v'. 
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We use dynamic programming to compute all n](2^)i where 0 < k' < k, 

1 < i < n + 1, and w[i..n\ is the suffix of w starting at the f-th character. We 
denote by T[k',i] the value Q'^., Every such T[k',i] will be computed 

just once and in time 0{k) if we precompute all the numbers for 1 < fc' < fc 
in 0{k'^) time. 

Then, we just have to compute Qk,w{x) = Yl^k>=i T[k\ 1], which can be 
done in 0{k) time. Hence the claimed overall complexity will follow. 

First, we claim that the following recurrence holds: 

{ 1 if fc' = 0 

0 if fc' > 0 and i = n+1 

T[k', z + 1] + T[k' — 1, z + 1] if A:' > 0 and i < n and zi;[z] = 0 

T[k' , z + 1] + T[k' — 1, z + l]x^ ii k' > 0 and z < n and w[i] = 1 

This is because of the following reasoning. We write every T[k',i] as a poly¬ 
nomial in X. Then, if the recurrence holds, it can be seen easily that T[k',i] is 
a sum, over all choices of z < ji < j 2 < • • • < jk’ ^ of monomials of the 
following form: 

^ ^w[j2]2'=' X • • • X 

But this is the same as summing monomials of the form: 

^w[o^\2’‘'-^+w[o^]2’‘'-^+■■■+w[jy]2\ 

Further, w[ji\2^'+w[j2\‘2,^'^ -h w[jfe']2° is really the number from [0, 2^') 

whose binary encoding is the word zc[ji][j 2 ] ■ • -wijk']- Therefore, the coefficient 
of x^ in T[A:',z] is exactly the number of ways we can choose a scattered factor 
zu[ji][j 2 ] • ■ -wijk'] of w[i..n] such that w[ji][j 2 ] • ■ -wijk'] is the binary encoding 
of in other words, this coefficient equals the number of scattered occurrences 
of the binary word corresponding to £ in w[i..n]. 

Consequently, we get that QJ,, ^{x) = T[fc', 1], as claimed. The conclusion of 
the lemma follows easily. □ 

We conclude this section by putting together all the preliminary results we 
have shown, to obtain the final Monte-Carlo algorithm solving Problem [TJ 


Randomised Algorithm 

let p be a random [fc -|- 1 -I- 2 log n]-bit prime 

choose random a; £ Fp 

compute Qk,ivi{x) and Qk,■W 2 ix) in Fp 

Q . — Q kyWi Qk,'W2 (.x) 

return YES if Q'wi,-W 2 {x) = 0, NO otherwise 
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The overall time complexity is clearly polynomial both in k and in n; as 
k < n, we conclude that this algorithm runs in polynomial time. More precisely, 
generating a prime number requires 0{t^ ) operations, where t = [/c + 1 + 2 log n]. 
Then, we use 0{nk) operations to fill the table. Therefore, the total time com¬ 
plexity is 0{nk^ -b (fc -b 1 -b logn)^fc). By considering the case k < logn and 
k > logn separately, we conclude that the total time complexity is 0{nk^ -b 

Now, if wi =k W2 then Qwi,w2(^) = 0 for all x € Fp, so the algorithm 
will always return YES. Otherwise, there are three ways it could err. First, we 
could have generated a composite p. This happens with probability at most ^. 
Second, it might happen that Qwi,w 2 is non-zero in the integers, but vanishes 
in the integers modulo p. By definition, the coefficients of Qwi,w 2 are bounded 
by n^, so if the polynomial is non-zero in the integers, yet vanishes modulo 
p, p must be a prime divisor of a fixed number bounded by n^. It is well 
known that the number of distinct prime divisors of x, denoted w(x), satis¬ 
fies uj{x) = ^( lo'gioga )- Because there are 7r(2*+^) — 7r(2‘) = 0{\) primes in 
the interval [2‘, 2*+^), for n large enough this happens with probability at most 
77(2 «+i"- 1(2«) ^ ^ logn ^+ = o(i). Third, our choice of x might have been 
unfortunate. By the Schwartz-Zippel lemma, this happens with probability at 
most ^ union bound, for large enough n, the total error probability 

is, consequently, less than i as required. 

Theorem 2. Prohlem\^ for input words of length n, can he solved hy a Monte- 
Carlo algorithm with running time 0{nk^ -bfc"^). The algorithm always returns a 
positive answer when the input words wi and W 2 are k-binomial equivalent, and 
returns a negative answer when wi and W 2 are not k-binomial equivalent with 
probability at least 1 — ^. 

4 Conclusion 

In this paper we considered the problem of deciding whether two given words 
ici and W 2 are fc-binomial equivalent. We gave two polynomial algorithms solv¬ 
ing this problem. The first one was deterministic, and was heavily relying on 
a known result showing that deciding whether two non-deterministic finite au¬ 
tomata are path-equivalent can be done in linear time. The second one was a 
direct algorithm, its running time was linear in the length of the input words, 
but it was no longer deterministic. 

The main consequence of our result is that also finding all the factors of 
a long word which are fc-binomial equivalent to a shorter one can be done in 
polynomial time; in other words, the problem of pattern matching under k- 
binomial equivalence can be solved in polynomial time. Indeed, one can check 
(using the algorithms presented in this paper) for all factors of the text whether 
they are fc-binomial equivalent to the pattern and return those for which this 
property holds. The next theorem follows. 

Theorem 3. Given two words w and x and a number k, we can find all the 
factors of w that are k-binomial equivalent to x in polynomial time. 
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The main open problems remaining from this work are to find simpler and 
more efficient algorithms solving Problem [T] as well as a pattern matching under 
fc-binomial equivalence solution that does not use testing fc-binomial equivalence 
as a subroutine. 
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